Machine Learning Project

DATS6202

Nhung Nguyen & Winnie Hu

Introduction

To reduce Barcelona’s accident rate, it is informative to learn about the location and the time that accidents occurs most. It is also useful to explore the correlation between accident and other factors including transportation types and unemployment rate. Through analyzing location, time, and other factors that correlate with accident, this project will show possible prevention for Barcelona’s accident rate.

The link to our dataset: https://www.kaggle.com/xvivancos/barcelona-data-sets#unemployment.csv

Research Question

This project explores the determinants of traffic accidents in Barcelona. Particularly, we want to study whether location, transportation type, time, and unemployment rate is correlated with traffic accidents.

Data Prepocessing

The 3 datasets were combined using Python. For instance, the transportation dataset was cleaned and then each transportation type were turned into dummies features. For unemployment dataset, we calculated the sum of the registered unemployment individuals over one year per each district in Barcelona. We then merged the employment rate with the original dataset based on the key of District Name. We deleted all NA values since our dataset has a large amount of samples. We then turned all the district name, week days, and month into dummies variable as well.

Methodology

We created a data set with categorical target that represents the severity of accidents based on mild and serious injuries.

In this project, we want to focus on studying the correlations between traffic accidents and several non-human factors, including location, transportation type, time, and unemployment rate. Therefore, we humanly selected a set of features based on our research interests before running the models.

Particularly, we want to study how traffic accidents perform during summer time (June - August), and at locations that have commonly seen transportation types, including Airport train, Cableway, Maritime station, Railway (FGC), and Underground.

Feature importance

In [6]:
# Draw the bar Plot from f_importances 
h = f_importances.plot(x='Features', y='Importance', kind='bar', figsize=(16,9), rot=80, fontsize=20)
#show the plot
h.set_ylabel('Importance', fontsize = 18)
h.set_xlabel('Features name', fontsize = 18)
plt.tight_layout()
plt.show()

Models

Models using hyperparameter tuning (adapted from exercise 7)

In [14]:
# Sort best_score_param_estimators in descending order of the best_score_

best_score_param_estimators = sorted(best_score_param_estimators, key=lambda x : x[0], reverse=True)

# For each [best_score_, best_params_, best_estimator_]
for best_score_param_estimator in best_score_param_estimators:
    # Print out [best_score_, best_params_, best_estimator_], where best_estimator_ is a pipeline
    # Since we only print out the type of classifier of the pipeline
    print([best_score_param_estimator[0], best_score_param_estimator[1], type(best_score_param_estimator[2].named_steps['clf'])], end='\n\n')
[0.9024322446143155, {'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2, 'clf__n_estimators': 30}, <class 'sklearn.ensemble.forest.RandomForestClassifier'>]

[0.9000463284688441, {'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2}, <class 'sklearn.tree.tree.DecisionTreeClassifier'>]

[0.6218438730599953, {'clf__n_neighbors': 9}, <class 'sklearn.neighbors.classification.KNeighborsClassifier'>]

Simple Decision Tree model (adapted from exercise 5)

A simple model with fewer features (only unemployment and transportation types)

In [18]:
dot_data = export_graphviz(pipe_dt.named_steps['DecisionTreeClassifier'],
                           filled=True, 
                           rounded=True,
                           feature_names=feature_value_names,
                           out_file=None) 

graph = graph_from_dot_data(dot_data) 


Image(graph.create_png()) 
Out[18]:

Conclusion and future work

Since we were interested in location, types of transportation, time, and unemployment, the models that we applied in this project were mainly constrained to those features. For future work, we can employ other possible determinants, such as immigration rate, income rate, death rate and other intriguing factors to further improve the model accuracy rate.

Additionally, we can explore other models such as GaussianNB and/or MLP to compare the results.